The Design of Syntactic Annotation Levels in the National Corpus of Polish
نویسندگان
چکیده
This paper presents the procedure of the syntactic annotation of the National Corpus of Polish. Syntactic annotation consists here of shallow parsing and manual post-editing of the results by annotators. The description concentrates on the delimitation of syntactic words and groups, as well as on problems encountered during the annotation process.
منابع مشابه
An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies
A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...
متن کاملRecent Developments in the National Corpus of Polish
The aim of the paper is to present recent — as of July 2009 — developments in the construction of the National Corpus of Polish. The main developments are: 1) the design of text encoding XML schemata for various levels of linguistic information, 2) a new tool for manual annotation at various levels, 3) numerous improvements in search tools.
متن کاملSyntactic processing of the IPI PAN Corpus of Polish
The aim of this paper is to present recent and ongoing work on adorning the IPI PAN Corpus of Polish (Przepiórkowski 2004, 2006a) with partial syntactic annotation, with the ultimate aim of building a treebank of Polish. The work described here is a part of the project Automatic extraction of linguistic knowledge from a large corpus of Polish (a Ministry of Education and Science grant number 3T...
متن کاملOn Heads and Coordination in Valence Acquisition
The aim of this paper is to present the design of a partial syntactic annotation of the IPI PAN Corpus of Polish [22] and the corresponding extension of the corpus search engine Poliqarp [25,12] developed at the Institue of Computer Science PAS and currently employed in Polish and Portuguese corpora projects. In particular, we will argue for the need to distinguish between, and represent both, ...
متن کاملLexicons and Grammars for Named Entity Annotation in the National Corpus of Polish
We present initial results in the named entity annotation subtask of a project aiming at creating the National Corpus of Polish. We summarize the annotation requirements de ned for this corpus, and we discuss how existing lexical resources and grammars for Polish named entities have been adapted to meet those requirements. We show rst results of the corpus annotation using the information extra...
متن کامل